This vignette describes the pmartR package functionality for quality control (QC) processing of mass spectrometry (MS) pan-omics data, in particular proteomics data at the peptide level. This includes data transformation, specification of groups that are to be compared against each other, filtering of features and/or samples, data normalization, and data summarization (correlation, principal components analysis). It is based on example data in the pmartRdata package.

Additional datasets available in the pmartRdata package include the following data types: protein level data, lipidomic data, and metabolomic data. A separate vignette describes the statistical analysis capabilities of pmartR.

1. Creating a data object (omicsData) in pmartR

The pmartR package relies on the creation of a standardized data object that is used for data manipulation. This data object requires specific formatting of input data and selection of data type to ensure that analyses are computed correctly. These data objects will be referred to overall as omicsData herein, where specific classes of omicsData will be referred to more specifically.

1a. Input data format

Below we load in 3 data.frames from the pmartRdata package. In this example, \(p\) is the number of unique peptides observed and \(n\) is the number of samples, respectively 17407 peptides and 12 samples.

  • pep_edata: A \(p * (n + 1)\) data.frame of expression data, where each row corresponds to data for each peptide and each column corresponds to a unique sample identifier. An additional peptide identifier/name column should also be present anywhere in the data.frame.

    The example denotes the unique peptide identifier/name column by ‘Mass_Tag_ID’ and all other column names are unique sample identifiers.

  • pep_fdata: A data.frame with \(n\) rows. Each row corresponds to a sample with one column giving the unique sample identifiers found in e_data column names and other columns providing qualitative and/or quantitative traits of each sample (e.g. experimental group, sample metadata, patient demographics).

    The example contains the sample identifier as “SampleID” and ‘Condition’ indicates the experimental group for each sample, either ‘Infection’ (samples that were infected with some pathogen) or ‘Mock’ (uninfected samples used as a control).

  • pep_emeta: An optional data.frame with at least \(p\) rows and a column with column name and entries identical to the unique identifier column in e_data. Each row corresponds to a peptide with one column giving peptide names and other columns giving meta information (e.g. mappings of peptides to proteins, class of molecule, pathways).

    The example pairs the peptide identifiers (‘Mass_Tag_ID’, corresponding with the same column in edata) with the protein identification name to which each mass tag ID maps (‘Protein’), a reference identifier (‘Ref_ID’), and the peptide sequence (‘Peptide_Sequence’). Note that the mass tag IDs are unique (\(n_{peptide} =\) 17407) while the protein identifiers are not (\(n_{protein} =\) 2774) since multiple peptides often map to the same protein.


pep_edata

data("pep_edata")    ## Load data.frame from pmartRdata
dim(pep_edata)       ## Print dimensions of the data.frame (rows, columns)
pep_edata[1:6,1:5]   ## Visualize the first 6 rows and first 5 columns of data
## [1] 17407    13
Mass_Tag_ID Infection1 Infection2 Infection3 Infection4
1024 NA NA NA NA
1047 17953839 20071472 20745779 18206556
1055 109536335 115459820 106127139 74522014
1104 1.752e+09 1.797e+09 1.703e+09 2.438e+09
1110 2571804 4269824 4852871 2630414
1214 110239193 82436688 100447189 1.02e+08

pep_fdata

data("pep_fdata")    ## Load data.frame from pmartRdata
dim(pep_fdata)       ## Print dimensions of the data.frame (rows, columns)
head(pep_fdata)      ## Visualize the first 6 rows of data
## [1] 12  2
SampleID Condition
Infection1 Infection
Infection2 Infection
Infection3 Infection
Infection4 Infection
Infection5 Infection
Infection6 Infection

pep_emeta

data("pep_emeta")    ## Load data.frame from pmartRdata
dim(pep_emeta)       ## Print dimensions of the data.frame (rows, columns)
head(pep_emeta)      ## Visualize the first 6 rows of data
## [1] 17407     4
Mass_Tag_ID Protein Ref_ID Peptide_Sequence
1024 ACBP_HUMAN 324 K.QATVGDINTERPGMLDFTGK.A
1047 ALBU_HUMAN 580 T.ECCHGDLLECADDR.A
1055 RL40_HUMAN 9898 K.TITLEVEPSDTIENVK.A
1104 ALBU_HUMAN 580 K.KVPQVSTPTLVEVSR.N
1110 CYC_HUMAN 2918 K.TGPNLHGLFGR.K
1214 G3P_HUMAN 4480 K.VGVNGFGR.I


1b. Selection of omicsData class

Several different classes of omicData are available in pmartR depending on if the user is conducting proteomic, lipidomic, or metabolomic analysis. They are classified as follows:

Proteomics (Peptides) - as.pepData(), as.isobaricpepData()

Proteomics (Proteins) - as.proData

Lipidomics - as.lipidData

Metabolomics - as.metabData()

Each of these data types have example data also included in the pmartRdata package in analogous formats to the peptide data described above. Different classes permit other class specific functions available in pmartR, including quantification of peptide-protein mappings using proteomic_filter(), normalizing proteomic data with spans_procedure, and quantifying proteoforms with bpquant(). For our example, as.pepData() is the appropriate choice and we will demonstrate each of these functions.


Feature note: Sample names beginning with numbers or containing dashes are modified in R when processed by read.csv() or data.frame() functions. This modification replaces dashes with periods and pastes an “X” before column names, throwing an error in when using the above functions since column names of e_data are not identical to f_data sample identifiers. Passing the argument check.names = FALSE will circumvent this issue; an example below demonstrates this instance using metabolomic data.


edata <- read.csv(system.file("extdata", 
                              "metab_edata_sample_names.csv", 
                              package="pmartRdata"), 
                  header=TRUE)

edata2 <- read.csv(system.file("extdata", 
                              "metab_edata_sample_names.csv", 
                              package="pmartRdata"), 
                   header=TRUE, check.names=FALSE)

fdata <- read.csv(system.file("extdata", 
                              "metab_fdata_sample_names.csv", 
                              package="pmartRdata"), 
                  header=TRUE)

all(names(edata)[-1] == fdata$SampleID) ## Do the sample names match each other?
## [1] FALSE
all(names(edata2)[-1] == fdata$SampleID) ## Do the sample names match each other?
## [1] TRUE

1c. Generate pepData

The pepData object contains the 3 components loaded above. During object creation, we specify the names of the identifier columns in the e_data, f_data, and e_meta components of the data, as well as the scale of the data (edata_cname = "Mass_Tag_ID", emeta_cname = "Protein", fdata_cname = "SampleID", data_scale = "abundance", respectively).


We can use the summary function to get basic summary information for our pepData object.

mypepData <- as.pepData(e_data = pep_edata, 
                        edata_cname = "Mass_Tag_ID", ## Unique biomolecule identifier
                        f_data = pep_fdata, 
                        fdata_cname = "SampleID",    ## Unique sample identifier
                        e_meta = pep_emeta, 
                        emeta_cname = "Protein",     ## Grouping identifier of interest
                        data_scale = "abundance")

class(mypepData)     ## pmartR class of data object
summary(mypepData)   ## pmartR summary of data object
## [1] "pepData"


Alternatively, the pmartRdata package contains the corresponding pepData object, which can be loaded as follows:

data("pep_object")  ## Load pre-rendered pmartR pepData object
class(pep_object)   ## pmartR class of data object
## [1] "pepData"
rm(pep_object)      ## remove data object (we use 'mypepData' herein)



We can use the plot() function to display box plots for each sample in the pepData object.

plot(mypepData)


Feature note: Long, inconvenient-for-plotting sample names can be modified using the function custom_sampnames(), where the identifiers can be trimmed at the user’s discretion (see “VizSampNames” column in code chunk below; all three commands generate the same trimming for the first 6 rows).


head(pmartR::custom_sampnames(mypepData, firstn = 3)$f_data)
head(pmartR::custom_sampnames(mypepData, from = 1, to = 3)$f_data)
head(pmartR::custom_sampnames(mypepData, delim = "e", components = 1)$f_data)
SampleID Condition VizSampNames
Infection1 Infection Inf
Infection2 Infection Inf
Infection3 Infection Inf
Infection4 Infection Inf
Infection5 Infection Inf
Infection6 Infection Inf

2. Workflow overview

Once the pepData object is created, a typical QC workflow follows the figure below.

Figure 1. Quality control and processing workflow in pmartR package.

Figure 1. Quality control and processing workflow in pmartR package.

We will walk through the steps in this QC workflow using mypepData. Statistics are covered in a separate vignette, ‘Statistical Analysis with the pmartR Package’.


3. Format & Preprocess

The formatting and preprocess steps of pmartR include zero-replacement, log transformation, and designating groups of interest to compare.

3a. Zero-replacement

Sometimes omics abundance data contains 0s when the particular biomolecule was not detected for a given sample. It is our practice to replace any 0s with NAs as opposed to imputing the missing values. We strongly prefer NAs to imputation because we cannot know whether the biomolecule is not present in the sample, simply was below the limit of detection, or was present but missing from our data at random. The function edata_replace() can be used to replace values in the pepData object.

mypepData <- edata_replace(mypepData, x = 0, y = NA)
## 0 instances of 0 have been replaced with NA

In this dataset, there were no 0s so nothing was replaced.


3b. Log transformation

We also highly recommend log transforming the data prior to analysis. pmartR supports log2, log10, and natural logarithm transformations. The edata_transform() function provides this capability. We can also use edata_trasnform() to transform the data back to the abundance scale if needed. Note that the scale of the data is stored and automatically updated in the data_info$data_scale attribute of the pepData object. Below we log10 transform the data, then return to the abundance scale, and finally settle on the log2 scale.

mypepData <- edata_transform(mypepData, data_scale = "log10")
attributes(mypepData)$data_info$data_scale
## [1] "log10"
mypepData <- edata_transform(mypepData, data_scale = "abundance")
attributes(mypepData)$data_info$data_scale
## [1] "abundance"
mypepData <- edata_transform(mypepData, data_scale = "log2")
attributes(mypepData)$data_info$data_scale
## [1] "log2"
plot(mypepData, bw_theme = TRUE)


Feature note: For isobaricpepData, please normalize using normalize_isobaric() before continuing. This function will normalize your data by a designated reference pool sample. If you have isobaric-labeled peptide data with no reference pool samples or have already accounted for reference samples, use as.pepData to read in your data.


3c. Group designation

Finally, we are preparing this data for statistical analysis where we will compare the samples belonging to one group to the samples belonging to another, and so we must specify the group membership of the samples. We do this using the group_designation() function, which modifies our pepData object and returns an updated version of it. Up to two main effects and up to two covariates may be specified (using covariates argument), with one main effect being required at minimum. The time_course argument permits users to designate a column containing time points. For the example data, we only have one option - specify the main effect to be “Condition” - since we do not have any additional information about the samples. Certain functions we will use below require that groups have been designated via the group_designation() function.

mypepData <- group_designation(mypepData, main_effects = "Condition", 
                               covariates = NULL, time_course = NULL)


The group_designation() function creates an attribute of the dataset as follows:

attributes(mypepData)$group_DF
plot(mypepData, color_by = "Condition", bw_theme=TRUE)


Feature note: Plotting in pmartR is automated for each omicsData object; further customization is available using the arguments color_by and order_by, allowing the user to select a single column from f_data to highlight changes or arrange sample order in the plot. Default plotting theme can also be altered by bw_theme = TRUE, equivalent to the effects of ggplot2 function theme_bw().


4. Filter Biomolecules

It is often good practice to filter out biomolecules (peptides in this case) that do not meet certain criteria, and pmartR offers 5 different filters. These are applied as follows:

Each of the filtering functions calculates metric(s) that can be used to filter out biomolecules and returns an S3 object separate from the omicsData object. Using the summary() function on the filter object produces a summary of the metric(s) and using the plot() function produces a graph. Filters that require a user-specified threshold to filter out peptides have corresponding summary and plot methods that take optional argument(s) to specify that threshold. Once one of the 5 peptide level filter functions has been called, the results of that function can be used in conjunction with the applyFilt() function to actually filter out peptides based on the metric(s) and user-specified threshold(s) and create a new, filtered data object.


4a. Proteomics Filter

The proteomics filter is applicable only to peptide level data that contains the e_meta component, as it counts the number of peptides that map to each protein and/or the number of proteins to which each individual peptide maps. It returns a list of two character vectors, the first, peptides_filt, giving degenerate peptide names. The second, proteins_filt, gives the names of proteins which no longer have peptides mapping to them in the dataset. This filter does not require a user-specified threshold.

myfilter <- proteomics_filter(mypepData)
summary(myfilter)
## 
## Summary of Proteomics Filter 
## 
##              Obs Proteins Per Peptide Obs Peptides Per Protein
## Min.                                1                        1
## 1st Qu.                             1                        1
## Median                              1                        3
## Mean                                1         6.27505407354001
## 3rd Qu.                             1                        7
## Max.                                1                      333
##                                                               
## Filtered                            0                        0
## Not Filtered                    17407                     2774
plot(myfilter, bw_theme=TRUE)


With the current dataset, the majority of proteins have one peptide mapping to them (graph on the left) and all peptides map to a single protein (e.g. no degenerate peptides; graph on the right). If filtering is necessary, refer to below example code for removal of degenerate peptides and requiring at least 10 peptides mapping to each protein:

mypepData_temp <- applyFilt(filter_object = myfilter, mypepData, 
                            degen_peps = TRUE,
                            min_num_peps = 10)
summary(mypepData_temp)
rm(mypepData_temp)


4b. Molecule Filter

Filters that require a user-specified threshold to filter out peptides have corresponding summary and plot methods that take optional argument(s) to specify that threshold.

The molecule filter allows the user to remove from the dataset any biomolecule not seen in at least min_num of the samples. The user may specify a threshold of the minimum number of times each biomolecule must be observed across all samples; the default value is 2.

myfilter <- molecule_filter(mypepData)
summary(myfilter, min_num = 3)
## 
## Summary of Molecule Filter 
## ----------------------------------
## Minimum Number:3
## Filtered:3342
## Not Filtered:14065
## 
##  num_observations frequency_counts
##                 1             1814
##                 2             3342
##                 3             4911
##                 4             5934
##                 5             6737
##                 6             7530
##                 7             8389
##                 8             9279
##                 9            10383
##                10            11527
##                11            12987
##                12            17407
plot(myfilter, bw_them = TRUE)


Setting the threshold to 3, we would filter 3,342 peptides out of the dataset. If we’d like to make the filter less stringent, we could use a threshold of 2 and only filter 1,872 peptides.

summary(myfilter, min_num = 2)
## 
## Summary of Molecule Filter 
## ----------------------------------
## Minimum Number:2
## Filtered:1814
## Not Filtered:15593
## 
##  num_observations frequency_counts
##                 1             1814
##                 2             3342
##                 3             4911
##                 4             5934
##                 5             6737
##                 6             7530
##                 7             8389
##                 8             9279
##                 9            10383
##                10            11527
##                11            12987
##                12            17407
mypepData <- applyFilt(filter_object = myfilter, mypepData, min_num = 2)
summary(mypepData)


4c. Coefficient of Variation Filter

The coefficient of variation (CV) filter calculates the pooled CV values as in (Ahmed 1995).

The user can then specify a CV threshold, above which peptides are removed.

myfilter <- cv_filter(mypepData)
summary(myfilter, cv_threshold = 150)
## 
## Summary of Coefficient of Variation (CV) Filter
## ----------------------
## CVs:
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.04778  24.85999  32.28337  34.83979  42.00244 168.69275 
## 
## Total NAs: 359
## Total Non-NAs: 15234 
## 
## Number Filtered Biomolecules: 2
plot(myfilter, cv_threshold = 150, title_size = 15, bw_theme = TRUE)

mypepData <- applyFilt(filter_object = myfilter, mypepData, cv_threshold = 150)
summary(mypepData)


4d. Custom Filter

Sometimes it is known a priori that certain peptides or samples should be filtered out of the dataset prior to statistical analysis. Perhaps there are known contaminant proteins, and so peptides mapping to them should be removed, or perhaps something went wrong during sample preparation for a particular sample. On the other hand, it may be preferred to specify peptides or samples to keep (removing those not explicitly specified), and this can also be accomplished. Keep in mind that both ‘remove’ and ‘keep’ arguments cannot be specified together; either ‘remove’ arguments only or ‘keep’ arguments only may be specified in a single call to custom_filter().

Here, we demonstrate the removal of the peptide with Mass Tag ID 1047 and sample Infection 1 as an example.

myfilter <- custom_filter(mypepData, 
                          e_data_remove = 1047, e_meta_remove = NULL, 
                          f_data_remove = "Infection1")
summary(myfilter)
## 
## Summary of Custom Filter
## 
##                       Filtered Remaining Total
## SampleIDs (f_data)           1        11    12
## Mass_Tag_IDs (e_data)        1     15590 15591
## Proteins (e_meta)            0      2511  2511
mypepData_temp <- applyFilt(filter_object = myfilter, mypepData)
summary(mypepData_temp)
rm(mypepData_temp)


We also demonstrate how to use this filter with the ‘keep’ option by keeping the samples Infection 1, Infection 2, Infection 3, Infection 4 and Infection 5.

myfilter2<- custom_filter(mypepData, 
                          e_data_keep = NULL, e_meta_keep = NULL, 
                          f_data_keep = c("Infection1", 
                                          "Infection2", 
                                          "Infection3", 
                                          "Infection4", 
                                          "Infection5"))
summary(myfilter2)
## 
## Summary of Custom Filter
## 
##                        Kept Discarded Total
## SampleIDs (f_data)        5         7    12
## Mass_Tag_IDs (e_data) 15591         0 15591
## Proteins (e_meta)      2511         0  2511
mypepData_temp2<- applyFilt(filter_object = myfilter2, mypepData)
summary(mypepData_temp2)
rm(mypepData_temp2)

Feature note: Note that there is a summary() method for objects of type custom_filt, but no plot() method.


4e. IMD-ANOVA Filter

The IMD-ANOVA filter removes peptides that do not have sufficient data for the statistical tests available in the pmartR package; these are ANOVA (quantitative test) and an independence of missing data (IMD) using a g-test (qualitative test) as specified in (Webb-Robertson et al. 2010). When using the summary(), plot(), and applyFilt() functions, you can specify just one filter (ANOVA or IMD) or both, depending on the tests you’d like to perform later. Using this filter speeds up the process of performing the statistical tests.

myfilter <- imdanova_filter(mypepData)


Here we consider what the filter would look like for both ANOVA and IMD.

summary(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)
## 
## Summary of IMD Filter
##                           
## Total Observations:  15591
## Filtered:             2052
## Not Filtered:        13539
# plot(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)


Here we consider what the filter would look like for just ANOVA.

summary(myfilter, min_nonmiss_anova = 2)
## 
## Summary of IMD Filter
##                           
## Total Observations:  15591
## Filtered:             5511
## Not Filtered:        10080
#plot(myfilter, min_nonmiss_anova = 2)


Here we consider what the filter would look like for just IMD.

summary(myfilter, min_nonmiss_gtest = 3)
## 
## Summary of IMD Filter
##                           
## Total Observations:  15591
## Filtered:             2280
## Not Filtered:        13311
#plot(myfilter, min_nonmiss_gtest = 3)


Now we apply the filter for both ANOVA and IMD.

mypepData <- applyFilt(filter_object = myfilter, 
                       mypepData, min_nonmiss_anova = 2, 
                       min_nonmiss_gtest = 3)

summary(mypepData)


5. Filter Samples

Sample filtering is encouraged in pmartR where samples are determined to be outliers, as these samples can impact further data analyses. Potential outlier detection is available in pmartR using the filter function rmd_filter(). Other functions in pmartR, dim_reduction() and cor_result() help the user determine if potential outliers are true outliers by visualizing groupings of samples and sample correlation trends.


5a. Potential outlier detection metrics

To identify any samples that are potential outliers or anomalies (due to sample quality, preparation, or processing circumstances), we use a robust Mahalanobis distance (rMd) (Matzke et al. 2011) score based on 2-5 metrics. The possible metrics are:

  • Correlation

  • Proportion of data that is missing (“Proportion_Missing”)

  • Median absolute deviation (“MAD”)

  • Skewness

  • Kurtosis


5b. Potential outlier identification

The rMd score can be mapped to a p-value, and a p-value threshold used to identify potentially outlying samples. In general, for proteomics data we recommend using correlation, proportion missing, MAD, and skew. A plot of the rMd values for each sample is generated, and specifying a value for ‘pvalue_threshold’ results in a horizontal line on the plot, with samples above the line (null hypothesis that the sample is not an outlier can be rejected) slated for filtering at the given threshold.

myfilter <- rmd_filter(mypepData, metrics = c("Correlation", 
                                              "Proportion_Missing", 
                                              "MAD", 
                                              "Skewness"))
summary(myfilter, pvalue_threshold = 0.001)
## 
## Summary of RMD Filter
## ----------------------
## P-values:
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000001 0.2686711 0.4152817 0.5050117 0.8306939 0.9707249 
## 
## Metrics Used: MAD, Skewness, Corr, Proportion_Missing 
## 
## Filtered Samples: Infection5, Infection8
plot(myfilter, pvalue_threshold = 0.001, bw_theme=TRUE)


Using the output from the summary() function, we can explore the outliers identified to see what metrics are driving their outlier-ness. Box plots for each metric are graphed, with the specified sample marked with an ‘X’.

plot(myfilter, sampleID = "Infection5", bw_theme=TRUE)

plot(myfilter, sampleID = "Infection8", bw_theme=TRUE)

We can see that Infection5 has the highest MAD, low skewness, and almost the lowest correlation, but the lowest proportion missing compared to the other samples. Infection 8 has MAD close to the median MAD value, the highest skew, the lowest correlation, and the highest proportion of missing data. We can use this information to determine whether to remove either or both of these samples. Sometimes it is also useful to look at additional data summaries to inform outlier removal, such as a principal components plot or correlation heat map.


6. Data summary

The pmartR package contains various methods for data summarization and exploration that can be used as part of the QC process: numeric summaries and associated plots (via the edata_summary() function), sequential projection pursuit principal components analysis (sppPCA) for dimension reduction (via thedim_reduction() function), and a correlation heat map (via the cor_result() function) (Stacklies et al. 2007), (Webb-Robertson et al. 2013). As for missing data summarization there are also functions to collect and plot missing data information. Missing data collection and plotting functions include, missingval_result(), plot_missingval(), missingval_scatterplot() and missingval_heatmap().


6a. Numeric Summaries

We can generate numeric summaries of our data by either sample or molecule. The argument groupvar can be used to specify if results should be split across an experimental group column. The edata_summary() function computes the mean, standard deviation, median, percent of observations for which a value was observed, the minimum value, and the maximum value.

The output of this function is a list of data.frames; we merge the list output here for conciseness.

listofsummary <- edata_summary(omicsData = mypepData, by = "sample", groupvar = NULL)

# Veiw as a single data.frame for vizualizing in rmd
dfsummary <- listofsummary[[1]]
for(i in 2:length(listofsummary)){
  dfsummary <- merge(dfsummary,listofsummary[[i]])
}
print(head(dfsummary))
##       sample     mean       sd   median   pct_obs      min      max
## 1 Infection1 21.78491 2.414556 21.75849 0.7965138 12.09441 30.94101
## 2 Infection2 21.89052 2.312247 21.83069 0.7900140 12.53455 30.92200
## 3 Infection3 21.85070 2.350433 21.79653 0.7303346 13.53600 30.94588
## 4 Infection4 21.81850 2.446171 21.79627 0.6917054 12.19476 31.18318
## 5 Infection5 22.22338 2.553529 22.26306 0.8023488 11.82416 31.74673
## 6 Infection6 21.87246 2.347379 21.80666 0.7264938 12.67596 31.37799


6b. Probabilistic Principal Components Analysis

Probabilistic is a principal components analysis (PCA) algorithm that can be used in the presence of missing data (Stacklies et al. 2007), (Webb-Robertson et al. 2013). We use the dim_reduction() function in pmartR, specifying the dataset and the number of principal components to return (default value of 2 principal components). The summary() and plot() functions operate on the results of the dim_reduction() function.

Note that dim_reduction() requires that the group_designation() function has been run.

pca_results <- dim_reduction(mypepData, k = 2)
# summary(pca_results) 
plot(pca_results, bw_theme=TRUE)


6c. Correlation Heatmap

Pearson correlation between samples is calculated based on pairwise complete biomolecules.

correlation_results <- cor_result(mypepData)
summary(correlation_results)
## Summary of Correlation Matrix (Upper Triangle)
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7804  0.8688  0.9143  0.9032  0.9496  0.9722
plot(correlation_results)


6d. Missing Data

Patterns of missing data can be explored using the missingval_result() and plot_missingval() functions.

results <- missingval_result(mypepData)
plot(results, type = "bySample", bw_theme=TRUE)

plot(results, type = "byMolecule", bw_theme=TRUE)

In addition, the missingval_scatterplot() and missingval_heatmap() functions provide more views of the missing data.

missingval_scatterplot(mypepData, palette = "Set1", bw_theme=TRUE)

missingval_heatmap(mypepData)


7. Normalize Data

The next step in a typical workflow is to normalize the data. Normalization can help reduce background noise while maintaining significant trends in datasets and often helps meet assumptions for statistical analysis.


7a. Normalization methods

Normalization approaches consist of a subset method and a normalization function (Åstrand 2003), (Gautier et al. 2004). Available subset methods include:

  • Percentage of peptides present (PPP): Subset the data to peptides present in at least \(p\) percent of samples. This subset method is specified as “ppp”.

  • Rank invariant peptides (RIP): First subset peptides to those present in every sample (e.g. complete peptides). Next, subject each peptide to a Kruskal Wallis test on group, and those peptides not significant at a given p-value threshold are retained as invariant. This subset method is specified as “rip”.

  • PPP-RIP: Rank invariant peptides among peptides present in a given percentage of samples. This subset method is specified as “ppp_rip”.

  • Top “l” order statistics (LOS): The peptides with intensities in the top “l” order statistics are retained. This subset method is specified as “los”.


Available normalization functions include:

  • Median centering: The sample-wise median (median of all peptides in a given sample) is subtracted from each peptide in the corresponding sample. This normalization method is specified as “median”.

  • Mean centering: The sample-wise mean (mean of all peptides in a given sample) is subtracted from each peptide in the corresponding sample. This normalization method is specified as “mean”.

  • Z-score transformation: The sample-wise mean (mean of all peptides in a given sample) is subtracted from each peptide, and the result is divided by the sample-wise standard deviation (standard deviation of all peptides in a given sample) in the corresponding sample. This normalization method is specified as “zscore”.

  • Median absolute deviation (MAD) transformation: The sample-wise median (median of all peptides in a given sample) is subtracted from each peptide, and the result is divided by the sample-wise MAD (MAD of all peptides in a given sample) in the corresponding sample. This normalization method is specified as “mad”.

There are two ways to go about normalizing data in the pmartR package. If you know the normalization approach you’d like to use, you can directly specify it in the normalize_data() function. Or, if you want to see how compatible different normalization approaches are with your dataset and you’re using proteomics data, you can use the spans_procedure() function.


7b. Normalization function

If we know what approach we want to use, we can go straight to the normalization.

# Normalize using all peptides and median centering #
norm_object <- normalize_global(omicsData = mypepData, 
                                subset_fn = "all", 
                                norm_fn = "median", 
                                apply_norm = FALSE, 
                                backtransform = TRUE)
norm_data <- normalize_global(omicsData = mypepData, 
                              subset_fn = "all", 
                              norm_fn = "median", 
                              apply_norm = TRUE, 
                              backtransform = TRUE)
plot(norm_data, 
     title_plot="Normalized Data: Median Centering Using all Peptides", 
     title_size=12,
     bw_theme = TRUE)

# Normalize using RIP 0.2 and median centering #
norm_object <- normalize_global(omicsData = mypepData, 
                                subset_fn = "rip", 
                                params=list(rip=0.2), 
                                norm_fn = "median", 
                                apply_norm = FALSE, 
                                backtransform = TRUE)
norm_data <- normalize_global(omicsData = mypepData, 
                              subset_fn = "rip", 
                              params=list(rip=0.2), 
                              norm_fn = "median", 
                              apply_norm = TRUE, 
                              backtransform = TRUE)
plot(norm_data, 
     title_plot="Normalized Data: Median Centering Using RIP 0.2 Peptides", 
     title_size=12,
     bw_theme = TRUE)

This is the same function that would be used post-SPANS if the user chooses to run SPANS on proteomic data.


7c. SPANS

If using proteomics data and unsure how to normalize, the spans_procedure() function will generate statistics the user can use to select a method. SPANS ranks different combinations of sub-setting and normalization methods based on a ‘score’ that captures how much bias a particular normalization procedure introduces into the data. A higher ‘score’ implies less bias. (Webb-Robertson et al. 2011)

SPANS tests bias for all the aforementioned sub-setting and normalization procedures by default. Default settings are used to determine the best normalization procedure for mypepData.

# returns a dataframe arranged by descending SPANS score
spans_result <- spans_procedure(mypepData)
## [1] "Finished creating background distribution, moving to method candidate selection"
## [1] "Finished method candidate selection, proceeding to score selected methods."
## [1] "Finished scoring selected methods"
summary(spans_result)
## 
## Summary of spans procedure
## 
## Highest ranked method(s)
##   subset_method normalization_method SPANS_score parameters
## 1           rip               median       0.999        0.1
## 2           rip               median       0.999       0.15
## 3           rip               median       0.999        0.2
## 4           rip               median       0.999       0.25
## 5       ppp_rip               median       0.999    0.1;0.1
## 6       ppp_rip               median       0.999  0.25;0.15
##   mols_used_in_norm passed_selection rank
## 1              1554             TRUE    1
## 2              1369             TRUE    1
## 3              1158             TRUE    1
## 4               987             TRUE    1
## 5              3195             TRUE    1
## 6              3195             TRUE    1
## 
## Number of input methods:  68
## Number of methods scored:  30
## Average molecules used in normalization:  5277
plot(spans_result)
# a list of the parameters for any normalization procedure with the best SPANS score
best_params <- get_spans_params(spans_result)

# there are a few ties, all using ppp_rip
# we'll select the method that uses median normalization and parameters ppp = 0.1 and rip = 0.1
subset_fn = best_params[[1]]$subset_fn
norm_fn = best_params[[1]]$norm_fn
params = best_params[[1]]$params

# we now pass the extracted subset function, normalization function, and parameters from SPANS to normalize_global()
norm_object <- normalize_global(omicsData = mypepData, 
                                subset_fn = subset_fn, 
                                norm_fn = norm_fn, 
                                params = params,
                                apply_norm = TRUE,
                                backtransform = TRUE)

Feature note: SPANS is restricted to peptide and protein data and works best when large datasets are available due to data sub-setting (though it does take some time). Errors may generate for certain procedures if the dataset size is insufficient. To fix this issue, try reducing the number of subset functions that are tested.


8. Protein Quantification

Protein quantification can be done using the protein_quant() function, either with or without accounting for different isoforms of the proteins, also called ‘proteoforms’. Regardless of whether the user is accounting for protein isoforms, they must specify a method for rolling peptides up to proteins (one of “rollup”, “rrollup”, “qrollup”, or “zrollup”) and a function to use for combining the peptide-level data (either “mean” or “median”). When the user does wish to account for protein isoforms, they have the option for doing so with Bayesian Proteoform Quantification (BP-Quant) (Webb-Robertson et al. 2014).


8a. Proteoform detection with BP-Quant (Webb-Robertson et al. 2014)

Proteoforms (including protein post-translational modification, splice variants, etc.) are important factors to consider when assessing the biological impact of experimental groups. BP-Quant predicts any proteoforms present for each protein group by identifying trends from quantitative (ANOVA) and qualitative (G-test for independence of missing data) tests across experimental group comparisons. These relationships are summarized in the format of a vector of “signatures”, where 0 indicates no statistically significant difference between experimental groups and 1 or -1 respectively represent a statistically significant increased or decreased value between experimental groups. BPQuant uses these signatures in the Bayesian model to select the set of proteoforms with the maximum likelihood for each protein (Webb-Robertson et al. 2014).

In pmartR, this method is available using the function bpquant(), demonstrated below. Note that this function takes advantage of statistical results generated from pmartR using the function imdanova(), further covered in the vignette ‘Statistical Analysis with the pmartR Package’.

The output of this function produced a list, where each element contains a data.frame defining proteoform identifiers for each peptide mapping to that protein. Where proteoform ID is zero, the peptide does not fit in the dominant trends identified for that protein.

peptideStats <- imd_anova(norm_object, 
                          test_method = "combined", 
                          pval_thresh = 0.05, 
                          pval_adjust = 'bonferroni')

proteoforms <- bpquant(peptideStats, norm_object)

proteoforms[[49]] ## An example of an output element

Feature note: Where there is insufficient data for either statistical test, the peptide abundances cannot be converted to a signature. The imdanova_filter will help to determine if this is the case for any dataset!


8b. Rollup Methods

The rollup method takes either the mean or median of all peptides mapping to a given protein, and sets that value as the relative protein abundance.

In the rrollup method, all peptides that map to a single protein are scaled based on a chosen reference peptide, which is the peptide with most observations. Next the average or mean of the scaled peptides is set as the relative protein abundance. (Matzke et al. 2013)

The qrollup method starts with all peptides that map to a single protein. Next peptides are chosen according to an abundance cutoff value and the average or mean of the scaled peptides is set as the protein abundance. (Polpitiya et al. 2008)

In the zrollup method, scaling is done similarly to the z-score formula (except with medians instead of means). The scaling formula is applied to peptides that map to a single protein and then the mean or median of the scaled peptides is set as protein abundance (from DAnTE article). The rollup method is similar to rrollup method, except there is no scaling involved in these methods. Either the mean or median is applied to all peptides that map to a single protein to obtain protein abundance. (Polpitiya et al. 2008)


The protein_quant() function returns a pmartR data object of class proData.

myprodata_rrollup<- pmartR::protein_quant(pepData = norm_object, method = "rrollup")
print(myprodata_rrollup)
## only first 5 columns are shown
## e_data
##                           Protein       Infection1       Infection2
## 1                      ALBU_HUMAN 30.7284373069703 30.6093269371943
## 2                      RL40_HUMAN 23.8194727741833 23.7523221873391
## 3                       CYC_HUMAN 22.4898474100337 22.3580306811018
## 4                       G3P_HUMAN 28.8165894353077 28.4266412920063
## 5                            ----             ----             ----
## 2250                  GSDMB_HUMAN             <NA>             <NA>
## 2251 gi|29836496|ref|NP_828851.1| 25.7649605608504 25.6610986933972
## 2252                 H1N1_CA04_NP             <NA>  19.431129958381
## 2253                  NXPH2_HUMAN 19.6465108627434 17.6474393552717
##            Infection3       Infection4
## 1    30.6317989290622 31.2311848782248
## 2    23.8767966168028 23.6976560416636
## 3    22.6535079624622 21.9723492626176
## 4     28.580536128935 28.6681866136105
## 5                ----             ----
## 2250 20.2167998857096             <NA>
## 2251 25.6180431466804 25.5663956728011
## 2252             <NA>             <NA>
## 2253             <NA>   18.46654128675
## 
## f_data
##      SampleID Condition
## 1  Infection1 Infection
## 2  Infection2 Infection
## 3  Infection3 Infection
## 4  Infection4 Infection
## 5        ----      ----
## 9  Infection9 Infection
## 10      Mock1      Mock
## 11      Mock2      Mock
## 12      Mock3      Mock
## 
## e_meta
##                           Protein Ref_ID      Peptide_Sequence
## 1                      ALBU_HUMAN    580    T.ECCHGDLLECADDR.A
## 2                      RL40_HUMAN   9898  K.TITLEVEPSDTIENVK.A
## 3                       CYC_HUMAN   2918       K.TGPNLHGLFGR.K
## 4                       G3P_HUMAN   4480          K.VGVNGFGR.I
## 5                            ----   ----                  ----
## 2250                  GSDMB_HUMAN   4995    K.CLGKEDIRQDLEQR.V
## 2251 gi|29836496|ref|NP_828851.1|  22574        K.EIDRLNEVAK.N
## 2252                 H1N1_CA04_NP  22287      R.EGYSLVGIDPFK.L
## 2253                  NXPH2_HUMAN   8006 K.ICYQEQTQSHVSWLCSK.P
##      peps_per_pro n_peps_used
## 1               5           5
## 2               8           8
## 3               6           6
## 4              49          49
## 5            ----        ----
## 2250            1           1
## 2251            2           2
## 2252            1           1
## 2253            1           1
rm(myprodata_rrollup)


References

Ahmed, SE. 1995. “A Pooling Methodology for Coefficient of Variation.” Sankhyā: The Indian Journal of Statistics, Series B, 57–75.

Åstrand, Magnus. 2003. “Contrast Normalization of Oligonucleotide Arrays.” Journal of Computational Biology 10 (1): 95–102.

Gautier, Laurent, Leslie Cope, Benjamin M Bolstad, and Rafael A Irizarry. 2004. “Affy—Analysis of Affymetrix Genechip Data at the Probe Level.” Bioinformatics 20 (3): 307–15.

Matzke, Melissa M, Joseph N Brown, Marina A Gritsenko, Thomas O Metz, Joel G Pounds, Karin D Rodland, Anil K Shukla, et al. 2013. “A Comparative Analysis of Computational Approaches to Relative Protein Quantification Using Peptide Peak Intensities in Label-Free Lc-Ms Proteomics Experiments.” Proteomics 13 (3-4): 493–503.

Matzke, Melissa M, Katrina M Waters, Thomas O Metz, Jon M Jacobs, Amy C Sims, Ralph S Baric, Joel G Pounds, and Bobbie-Jo M Webb-Robertson. 2011. “Improved Quality Control Processing of Peptide-Centric Lc-Ms Proteomics Data.” Bioinformatics 27 (20): 2866–72.

Polpitiya, Ashoka D, Wei-Jun Qian, Navdeep Jaitly, Vladislav A Petyuk, Joshua N Adkins, David G Camp, Gordon A Anderson, and Richard D Smith. 2008. “DAnTE: A Statistical Tool for Quantitative Analysis of-Omics Data.” Bioinformatics 24 (13): 1556–8.

Stacklies, Wolfram, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. “PcaMethods—a Bioconductor Package Providing Pca Methods for Incomplete Data.” Bioinformatics 23 (9): 1164–7.

Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Susmita Datta, Samuel H Payne, Jiyun Kang, Lisa M Bramer, Carrie D Nicora, et al. 2014. “Bayesian Proteoform Modeling Improves Protein Quantification of Global Proteomic Measurements.” Molecular & Cellular Proteomics 13 (12): 3639–46.

Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Jon M Jacobs, Joel G Pounds, and Katrina M Waters. 2011. “A Statistical Selection Strategy for Normalization Procedures in Lc-Ms Proteomics Experiments Through Dataset-Dependent Ranking of Normalization Scaling Factors.” Proteomics 11 (24): 4736–41.

Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Thomas O Metz, Jason E McDermott, Hyunjoo Walker, Karin D Rodland, Joel G Pounds, and Katrina M Waters. 2013. “Sequential Projection Pursuit Principal Component Analysis–Dealing with Missing Data Associated with New-Omics Technologies.” Biotechniques 54 (3): 165–68.

Webb-Robertson, Bobbie-Jo M, Lee Ann McCue, Katrina M Waters, Melissa M Matzke, Jon M Jacobs, Thomas O Metz, Susan M Varnum, and Joel G Pounds. 2010. “Combined Statistical Analyses of Peptide Intensities and Peptide Occurrences Improves Identification of Significant Peptides from Ms-Based Proteomics Data.” Journal of Proteome Research 9 (11): 5748–56.